SVMs and Kernels

In this notebook I'll fit a Support Vector Machine (SVM) classifer to data using scikit-learn. Using different kernels and experimenting on how kernels produce nonlinear decision surfaces. I'll also predict the labels for datapoints and measure the performance of the SVM.

In [1]:
#Loading the appropriate packages
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import plotly.graph_objs as go
#from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
#init_notebook_mode(connected=True)

#turn off the scientific notation for floating point numbers.
np.set_printoptions(suppress=True)

Loading and examining the data

The data is from a CSV file.

This dataset is the breast cancer Wisconsin (diagnostic) dataset which contains 30 different features computed from a images of a fine needle aspirate (FNA) of breast masses for 569 patients with each example labeled as being a benign or malignant mass.

  • This was taken and modified from the Machine Learning dataset repository of School of Information and Computer Science of University of California Irvine (UCI):

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

In [2]:
df = pd.read_csv('data_svms_and_kernels.csv')
df
Out[2]:
Feature 1 Feature 2 Label
0 0.109478 -0.168109 B
1 -0.590131 -0.153594 B
2 0.157117 0.059849 A
3 -0.623096 -0.638964 B
4 -0.296598 0.315015 B
... ... ... ...
287 -0.218594 0.302159 B
288 -0.283552 0.481753 B
289 0.299025 0.073648 A
290 0.489956 -0.092362 A
291 0.059036 -0.128766 B

292 rows × 3 columns

Now I'll extract data from the dataframe in NumPy arrays and using LabelEncoder from scikit-learn to transform labels into $\{-1,+1\}$:

In [3]:
X = df.drop('Label', axis=1).to_numpy()
y_text = df['Label'].to_numpy()
y = (2 * LabelEncoder().fit_transform(y_text)) - 1

Let's check X, y_text and y:

In [4]:
X.shape
Out[4]:
(292, 2)
In [5]:
y_text
Out[5]:
array(['B', 'B', 'A', 'B', 'B', 'B', 'A', 'B', 'B', 'A', 'A', 'A', 'B',
       'B', 'A', 'A', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B',
       'A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'B', 'B', 'B',
       'A', 'A', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'B', 'A', 'A',
       'A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B',
       'B', 'A', 'A', 'A', 'B', 'B', 'A', 'B', 'B', 'B', 'A', 'A', 'B',
       'B', 'B', 'B', 'B', 'B', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B',
       'B', 'A', 'B', 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'A', 'A', 'A',
       'A', 'B', 'B', 'A', 'A', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B',
       'B', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B',
       'B', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'A', 'A', 'B', 'A', 'B',
       'A', 'B', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'B', 'A',
       'B', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B',
       'B', 'B', 'B', 'A', 'B', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A',
       'A', 'B', 'A', 'A', 'A', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B',
       'A', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A', 'B', 'A', 'A',
       'A', 'B', 'A', 'B', 'B', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'B',
       'B', 'A', 'B', 'A', 'B', 'A', 'A', 'A', 'A', 'A', 'B', 'A', 'B',
       'B', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'B', 'A', 'A', 'A',
       'B', 'B', 'B', 'B', 'A', 'B', 'B', 'A', 'A', 'B', 'A', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'B', 'B', 'A', 'B', 'B', 'A', 'A', 'B',
       'B', 'B', 'A', 'B', 'B', 'B', 'B', 'B', 'A', 'B', 'B', 'A', 'B',
       'B', 'B', 'B', 'A', 'A', 'B'], dtype=object)
In [6]:
y
Out[6]:
array([ 1,  1, -1,  1,  1,  1, -1,  1,  1, -1, -1, -1,  1,  1, -1, -1, -1,
        1, -1,  1, -1, -1,  1,  1,  1,  1, -1,  1, -1, -1,  1,  1,  1, -1,
       -1,  1,  1,  1,  1, -1, -1,  1,  1, -1,  1, -1,  1, -1,  1,  1, -1,
       -1, -1, -1, -1,  1,  1, -1, -1, -1, -1, -1, -1,  1,  1,  1, -1, -1,
       -1,  1,  1, -1,  1,  1,  1, -1, -1,  1,  1,  1,  1,  1,  1, -1,  1,
        1,  1, -1,  1, -1,  1,  1, -1,  1, -1,  1, -1, -1, -1,  1, -1, -1,
       -1, -1, -1,  1,  1, -1, -1, -1,  1, -1,  1, -1, -1,  1,  1,  1, -1,
        1,  1,  1,  1, -1, -1,  1,  1, -1, -1,  1,  1,  1,  1, -1, -1, -1,
       -1,  1, -1, -1,  1, -1,  1, -1,  1,  1,  1,  1,  1, -1, -1, -1, -1,
       -1,  1, -1,  1, -1,  1,  1, -1, -1, -1, -1, -1, -1, -1, -1,  1,  1,
        1,  1, -1,  1,  1,  1,  1, -1,  1, -1,  1, -1, -1,  1, -1, -1, -1,
       -1,  1, -1,  1, -1, -1,  1,  1, -1,  1, -1, -1,  1,  1, -1, -1,  1,
       -1,  1, -1, -1, -1,  1, -1,  1,  1, -1,  1,  1, -1, -1, -1, -1,  1,
        1, -1,  1, -1,  1, -1, -1, -1, -1, -1,  1, -1,  1,  1,  1, -1, -1,
        1,  1,  1,  1, -1,  1, -1, -1, -1,  1,  1,  1,  1, -1,  1,  1, -1,
       -1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1, -1, -1,
        1,  1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,
       -1, -1,  1])

Scatter ploting the data:

In [7]:
points_colorscale = [
                     [0.0, 'rgb(239, 85, 59)'],
                     [1.0, 'rgb(99, 110, 250)'],
                    ]

points = go.Scatter(
                    x=df['Feature 1'],
                    y=df['Feature 2'],
                    mode='markers',
                    marker=dict(color=y,
                                colorscale=points_colorscale)
                   )
layout = go.Layout(
                   xaxis=dict(range=[-1.05, 1.05]),
                   yaxis=dict(range=[-1.05, 1.05])
                  )

fig = go.Figure(data=[points], layout=layout)
fig.show()

Splitting data

It's time to split data into training, validation and test sets. Let's use 60% for training, 20% for validation and 20% for test data.

In [8]:
(X_train, X_vt, y_train, y_vt) = train_test_split(X, y, test_size=0.4, random_state=0)
(X_validation, X_test, y_validation, y_test) = train_test_split(X_vt, y_vt, test_size=0.5, random_state=0)

Building and visualizing a SVM

I'll use the SVC class from scikit-learn. For now I am not using kernels, so I set the kernel argument of SVC to 'linear'.

In [9]:
svm = SVC(kernel='linear')
In [10]:
# fit svm to X_train and y_train
svm.fit(X_train, y_train)
Out[10]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

Code a function to plot decision surface:

In [12]:
def svm_show(svm):
    decision_colorscale = [
                           [0.0, 'rgb(239,  85,  59)'],
                           [0.5, 'rgb(  0,   0,   0)'],
                           [1.0, 'rgb( 99, 110, 250)']
                          ]

    detail_steps = 100

    (x_vis_0_min, x_vis_1_min) = (-1.05, -1.05) #X_train.min(axis=0)
    (x_vis_0_max, x_vis_1_max) = ( 1.05,  1.05) #X_train.max(axis=0)

    x_vis_0_range = np.linspace(x_vis_0_min, x_vis_0_max, detail_steps)
    x_vis_1_range = np.linspace(x_vis_1_min, x_vis_1_max, detail_steps)

    (XX_vis_0, XX_vis_1) = np.meshgrid(x_vis_0_range, x_vis_0_range)

    X_vis = np.c_[XX_vis_0.reshape(-1), XX_vis_1.reshape(-1)]

    YY_vis = svm.decision_function(X_vis).reshape(XX_vis_0.shape)

    points = go.Scatter(
                        x=df['Feature 1'],
                        y=df['Feature 2'],
                        mode='markers',
                        marker=dict(
                                    color=y,
                                    colorscale=points_colorscale),
                        showlegend=False
                       )
    SVs = svm.support_vectors_
    support_vectors = go.Scatter(
                                 x=SVs[:, 0],
                                 y=SVs[:, 1],
                                 mode='markers',
                                 marker=dict(
                                             size=15,
                                             color='black',
                                             opacity = 0.1,
                                             colorscale=points_colorscale),
                                 line=dict(dash='solid'),
                                 showlegend=False
                                )

    decision_surface = go.Contour(x=x_vis_0_range,
                                  y=x_vis_1_range,
                                  z=YY_vis,
                                  contours_coloring='lines',
                                  line_width=2,
                                  contours=dict(
                                                start=0,
                                                end=0,
                                                size=1),
                                  colorscale=decision_colorscale,
                                  showscale=False
                                 )

    margins = go.Contour(x=x_vis_0_range,
                         y=x_vis_1_range,
                         z=YY_vis,
                         contours_coloring='lines',
                         line_width=2,
                         contours=dict(
                                       start=-1,
                                       end=1,
                                       size=2),
                         line=dict(dash='dash'),
                         colorscale=decision_colorscale,
                         showscale=False
                        )

    fig2 = go.Figure(data=[margins, decision_surface, support_vectors, points], layout=layout)
    return fig2.show()

Let's visualize the decision surface the svm with its supprt vectors:

In [13]:
svm_show(svm)

The datapoints, the decision surface (which is a line here), the margins and the support vectors are shown in the plot.

Kernels

As we can see in the plot, the decision surface is underfiiting the data. Let's use a polynomial kernel. I define svm_p2 to be an instance of class SVC but this time with arguments kernel='poly' and degree=2 to define a degree-2 polynomial kernel:

In [14]:
svm_p2 = SVC(kernel='poly', degree=2)
In [15]:
# fit it to your training data:
svm_p2.fit(X_train, y_train)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning:

The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.

Out[15]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=2, gamma='auto_deprecated',
    kernel='poly', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
In [16]:
# visualize this model with the function svm_show
svm_show(svm_p2)

Looks better. But let's try a degree 3 model. svm_p3 with degree=3 this time:

In [17]:
svm_p3 = SVC(kernel='poly', degree=3)
In [18]:
svm_p3.fit(X_train, y_train)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning:

The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.

Out[18]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='poly', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
In [19]:
svm_show(svm_p3)

Let's try a RBF (Radial Basis Function) kernel as well. RBFs are the default kernel for scikit-learn's SVC. I'll build a model svm_r with either kernel=rbf argument setting (the default one) so just skip the kernel (also the degree argument is uselss here, since we are not using a polynomial kernel, we just skip that):

In [20]:
svm_r = SVC()
In [21]:
svm_r.fit(X_train,y_train)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\svm\base.py:193: FutureWarning:

The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.

Out[21]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
In [22]:
svm_show(svm_r)

Model selection

How to pick the best model then? I'll use the validation data. Let's predict using svm and X_train and assign it the name yhat_train. Also, I'll predict X_validation and assign it the name yhat_validation (. The closeness of the accuracy of predictions on these two datasets will be helpful):

In [23]:
models = [svm,svm_p2,svm_p3,svm_r]
names_y_train = ['yhat_train','yhat_train_p2','yhat_train_p3','yhat_train_r']
names_y_validation = ['yhat_validation','yhat_validation_p2','yhat_validation_p3','yhat_validation_r']

for a,b,c in zip(models, names_y_train, names_y_validation):  
    globals()[b] = a.predict(X_train)
    globals()[c] = a.predict(X_validation)
print('accuracy_score(yhat_train, y_train), accuracy_score(yhat_validation, y_validation)\n',
      accuracy_score(yhat_train, y_train),'\t\t\t',
      accuracy_score(yhat_validation, y_validation),'\n')
print('accuracy_score(yhat_train_p2, y_train), accuracy_score(yhat_validation_p2, y_validation)\n',
      accuracy_score(yhat_train_p2, y_train),'\t\t\t',
      accuracy_score(yhat_validation_p2, y_validation),'\n')
print('accuracy_score(yhat_train_p3, y_train), accuracy_score(yhat_validation_p3, y_validation)\n',
      accuracy_score(yhat_train_p3, y_train),'\t\t\t',
      accuracy_score(yhat_validation_p3, y_validation),'\n')
print('accuracy_score(yhat_train_r, y_train), accuracy_score(yhat_validation_r, y_validation)\n',
      accuracy_score(yhat_train_r, y_train),'\t\t\t',
      accuracy_score(yhat_validation_r, y_validation),'\n')
accuracy_score(yhat_train, y_train), accuracy_score(yhat_validation, y_validation)
 0.9142857142857143 			 0.9482758620689655 

accuracy_score(yhat_train_p2, y_train), accuracy_score(yhat_validation_p2, y_validation)
 0.5371428571428571 			 0.46551724137931033 

accuracy_score(yhat_train_p3, y_train), accuracy_score(yhat_validation_p3, y_validation)
 0.5828571428571429 			 0.5172413793103449 

accuracy_score(yhat_train_r, y_train), accuracy_score(yhat_validation_r, y_validation)
 0.9314285714285714 			 0.9827586206896551 

From all these number we can see that the RBF model works best as the accuracy on validation data is high and also the gap between the accuracy on training and validation data is not big. We could further tune the generalization power of our model by tuning the argument C of SVC which is the inverse of a regularization coefficient.

Final assessment

Finally, let's check accuracy on the test data to get a final performance number. Predict yhat_test_r from X_test on svm_r:

In [24]:
yhat_test_r = svm_r.predict(X_test)
accuracy_score(yhat_test_r, y_test)
Out[24]:
0.9491525423728814

We have good performance of test data.